Learning Word Vectors for 157 Languages

نویسندگان

  • Edouard Grave
  • Piotr Bojanowski
  • Prakhar Gupta
  • Armand Joulin
  • Tomas Mikolov
چکیده

Distributed word representations, or word vectors, have recently been applied to many tasks in natural language processing, leading to state-of-the-art performance. A key ingredient to the successful application of these representations is to train them on very large corpora, and use these pre-trained models in downstream tasks. In this paper, we describe how we trained such high quality word representations for 157 languages. We used two sources of data to train these models: the free online encyclopedia Wikipedia and data from the common crawl project. We also introduce three new word analogy datasets to evaluate these word vectors, for French, Hindi and Polish. Finally, we evaluate our pre-trained word vectors on 10 languages for which evaluation datasets exists, showing very strong performance compared to previous models.

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Towards cross-lingual distributed representations without parallel text trained with adversarial autoencoders

Current approaches to learning vector representations of text that are compatible between different languages usually require some amount of parallel text, aligned at word, sentence or at least document level. We hypothesize however, that different natural languages share enough semantic structure that it should be possible, in principle, to learn compatible vector representations just by analy...

متن کامل

The Effect of Mnemonic Key Word Method on Vocabulary Learning and Long Term Retention

Most of the studies on the key word method of second/foreign language vocabulary learning have been based on the evidence from laboratory experiments and have primarily involved the use of English key words to learn the vocabularies of other languages. Furthermore, comparatively quite limited number of such studies is done in authentic classroom contexts. The present study inquired into the eff...

متن کامل

Codeswitching language identification using Subword Information Enriched Word Vectors

Codeswitching is a widely observed phenomenon among bilingual speakers. By combining subword information enriched word vectors with linear-chain Conditional Random Field, we develop a supervised machine learning model that identifies languages in a English-Spanish codeswitched tweets. Our computational method achieves a tweet-level weighted F1 of 0.83 and a token-level accuracy of 0.949 without...

متن کامل

Words are not Equal: Graded Weighting Model for Building Composite Document Vectors

Despite the success of distributional semantics, composing phrases from word vectors remains an important challenge. Several methods have been tried for benchmark tasks such as sentiment classification, including word vector averaging, matrix-vector approaches based on parsing, and on-the-fly learning of paragraph vectors. Most models usually omit stop words from the composition. Instead of suc...

متن کامل

Generating the Pseudo-Powers of a Word

The notions of power of word, periodicity and primitivity are intrinsically connected to the operation of catenation, that dynamically generates word repetitions. When considering generalizations of the power of a word, other operations will be the ones that dynamically generate such pseudo-repetitions. In this paper we define and investigate the operation of θ-catenation that gives rise to the...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:
  • CoRR

دوره abs/1802.06893  شماره 

صفحات  -

تاریخ انتشار 2018